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Influence analysis is a fundamental problem in social network analysis and mining. The important applica- 
tions of the influence analysis in social network include influence maximization for viral marketing, finding 
the most influential nodes, online advertising, etc. For many of these applications, it is crucial to evaluate the 
influenceability of a node. In this paper, we study the problem of evaluating influenceability of nodes in social 
network based on the widely used influence spread model, namely, the independent cascade model. Since 
this problem is #P-complete, most existing work is based on Naive Monte-Carlo (NMO sampling. However, 
the NMC estimator typically results in a large variance, which significantly reduces its effectiveness. To 
overcome this problem, we propose two families of new estimators based on the idea of stratified sampling. 
We first present two basic stratified sampling (BSS) estimators, namely BSS-I estimator and BSS-II estima- 
tor, which partition the entire population into 2 r and r + 1 strata by choosing r edges respectively. Second, to 
further reduce the variance, we find that both BSS-I and BSS-II estimators can be recursively performed on 
each stratum, thus we propose two recursive stratified sampling (ESS) estimators, namely ESS-I estimator 
and RSS-II estimator. Theoretically, all of our estimators are shown to be unbiased and their variances are 
significantly smaller than the variance of the NMC estimator. Finally, our extensive experimental results 
on both synthetic and real datasets demonstrate the efficiency and accuracy of our new estimators. 
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1. INTRODUCTION 

Large scale online social networks (OSNs) such as Facebook and Twitter have become 
increasingly popular in the last years. Users in OSNs are able to share thoughts, activ- 
ities, photos, and other information with their friends. As a result, the OSNs become an 
important medium for information dissemination and influence spread. A fundamen- 
tal problem in su ch OSNs is to analyze and study the social influence among users 
| [Tang et al. 2009 ). Important applica tions of influence analysis in OSNs in clude influ- 
ence maximization for viral markin g [Kempe et al. 2003 ; C hen et al. 2010H . finding the 



most influential nodes IILiu et al. 2 009; Lappas et al. 2010L online advertising, etc. Es 



pecially, the influence maxim ization problem has recently attracted tremendous atten- 
tion in research community ILeskovec et al. 2007[ IChen et al. 2009l IChen et al. 2010l 



Goyal et al. 2011J . For many of these applications, a very important step is to accu- 
rately evaluate the influenceability of a node in OSNs. 

The influenceability evaluation problem is based on influence spread in a network. 
Generally, the influence spread in a network can be modeled as a stochastic cascade 
model. In the literature, a widely used cascade mode is the independent cascade (IC) 
model. In the IC model, each node i has a single chance to influence his/her neighbor 
j with a probability pij, and such "influence event" is independent of the other "influ- 
ence events" over other nodes. Due to the independent property, the IC model can be 
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represented by the probabilistic graph model, where each edge in the graph is associ- 
ated with a probabilit y and the existence of an edge is independent of any other edges 
IPotamias et al. 2010L In this paper, we focus on the IC model and assume that the 
influence probabilities of all the edges in a social network are given in advanced In 
addition, we use the probabilistic graph model to represent the IC model. 

This problem is equivalent to calculate the expec ted number of no des in Q that are 
reachable from s, which is known to be #P-complete MChen et al. 20101 . The existing al- 
gorithms for this problem are based on naive Monte-Car lo sampling estimator (NMC) 
I Kempe et al. 2003} [Kempe et al. 2005] IChen et al. 20091 . However, NMC may result 



in a large variance, which significantly reduces its effectiveness. We will discuss this 
issue in detail in Section [3] 

Given the IC model and a seed node s, the influenceability evaluation problem is 
to compute the expected influence spread by the seed node s. This problem is equiv- 
alent to calculate the expected number of nodes in a probabilistic grap h Q that are 
reachable from s, which is known to be #P-complete BChen et al. 20101 . As a result, 
there is no hope to exactly evaluate the influenceability in polynomial time unless 
P=#P. The existing algorithm for this problem is based on N aive Monte-Carlo sam- 
pling | Kempe et al. 2003} |Kempe~et al. 20051 [Chen et al. 200911 . As our analysis given 



in Section [3J the Naive Monte-Carlo (NMC) estimator leads to a large variance, and 
thus it significantly reduces the effectiveness of the estimator. Theoretically, the NMC 
estimator can achieve arbitrarily close approximation to the exact value of the in- 
fluenceability. However, this requires a large number of samples. Since performing a 
Monte-Carlo estimation needs to flip m coins to determine all the m edges of the net- 
work, the NMC estimator is extremely expensive to get a meaningful approximation 
of the influenceability in large networks. Consequently, the key issue to accelerate the 
NMC estimator is to reduce the number of samples that are needed to achieve a good 
accuracy. 

In order to reduce the number of samples used in the NMC estimator, one potential 
solution is to reduce its variance. In this paper, we propose two types of the Monte- 
Carlo estimator, namely type-I estimator and type-II estimator, based on the idea of 
stratified sampling. All of our proposed estimators are shown to be unbiased and their 
variance are significantly smaller than the variance of the NMC estimator. To the 
best of our knowledge, this is the first work that addresses and studies the variance 
problem in NMC for influenceability evaluation problem. 

To develop new type-I estimators, we devise an exact divide-and-conquer enumera- 
tion algorithm. Our exact algorithm starts by enumerating r edges, thus resulting in 
2 r cases. Then, for each case the algorithm recursively enumerates another r edges. 
The recursion will terminate after all the m edges are enumerated. This exact algo- 
rithm has exponential time complexity to evaluate node's influenceability. Based on 
the exact algorithm, we propose a basic stratified sampling (BSS) estimator, namely 
BSS-I estimator, to estimate a node's influenceability. In particular, we first select r 
edges and determine their statuses (existence or inexistence). Obviously, this process 
generates 2 r cases. Then, we let each case be a stratum, and draw samples separately 
from each stratum. By carefully allocating the sample size for each stratum, we prove 
that the variance of the BSS-I estimator is smaller than the variance of the NMC es- 
timator. Interestingly, we find that our BSS-I estimator can be recursively performed 
in each stratum, and thereby we propose a recursive stratified sampling estimator, 
namely RSS-I estimator. Since the RSS-I estimator recursively reduces the variance 
in each stratum, its variance is significantly smaller than the variance of the BSS-I 



1 Learni ng the influence pr obabilities is out of scope of this paper. In the literature, there are some studies, 
such as I Goyal et al. 2010 1, on learning the influence probabilities in social network. 
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estimator. It is important to note that both BSS-I and RSS-I estimators have the same 
time complexity as the NMC estimator. 

In addition to the type-I estimators (BSS-I and RSS-I), we further develop two type- 
II estimators based on a new stratification method. The new stratification method 
partitions the population into r + 1 strata by picking r edges. In the first stratum which 
is denoted by stratum 0, we set the statuses of all the r edges to "0", which denotes the 
edge inexistence. In the i-th (1 < i < r) stratum, we set the statuses of all the first 
i - 1 edges to "0", the i-th edge to "1", which signifies the edge existence, and the rest 
r — i edges to "*", which denotes the status of the edge to be determined. Based on 
such stratification approach, we propose a basic stratified sampling estimator, namely 
BSS-II estimator. Similar to the idea of the RSS-I estimator, we develop a recursive 
stratified sampling estimator based on BSS-II estimator, namely RSS-II estimator. We 
conduct extensive experimental studies on both synthetic and real datasets, and we 
show that both RSS-I and RSS-II estimators reduce the variance of the NMC estimator 
significantly. 

Note that the stratification approach in both type-I and type-II estimators are based 
on the r selected edges. Thus, an edge-selection strategy may significantly affect the 
performance of the estimators. In this paper, we present two edge-selection strategies 
for the proposed estimators: random edge-selection and Breadth-First-Search (BFS) 
edge-selection. The random edge-selection is to pick r unsampled edges randomly for 
stratification, while the BFS edge-selection picks r unsampled edges according to their 
BFS visiting order (the BFS starts from the seed node s). In our experiments, we show 
that an estimator with the BFS edge-selection strategy significantly outperforms the 
same estimator with the random edge-selection strategy. 

Besides the influenceability estimation problem in social networks, our proposed es- 
timation methods can be applied in many other application domains. For example, 
consider an application in a communication network with link failure. Given a router 
s, it needs to count the expected number of hosts in the network that are reachable 
from s. Such count assists network resource planing, and is also useful for network re- 
source estimation, for example in P2P networks. Our proposed algorithms can provide 
accurate estimators for such application domains. In addition, our influenceability es- 
timation methods can be di rectly used to the so-called influence function evaluation 
problem | Kempe et al. 2003) , in which the seed is not only one node but a set of nodes. 



We can solve this problem by adding a virtual node s and link it to the set of seed 
nodes. Finally, our proposed stratified sampling estimators are very general, and can 
be easily us ed to handle u ncertain graph mining problems, suc h as network reliability 
estimation MRubino 1999L shortest path MPotamias et al. 201011 , and reachability com- 



putation problem IJin et al. 2011bi 



The rest of this paper is organized as follows. We give the problem statement in 
Section |2l and introduce the Naive Monte-Carlo estimator in Section [3) We propose 
the type-I and type-II estimators in Section [4] and Section [5l respectively. Extensive 
experimental studies are reported in Section [6l Section [7] discusses the related work 
and Section [8] concludes this work. 

2. PROBLEM STATEMENT 

We consider a social network G = (V,E), where V denotes a set of nodes and E denotes 
a set of directed edges between the nodes. Let n = \V\ and m = \E\be the number of 
nodes and edges in G, respectively. In a social network, users (nodes) can perform 
actions, and the actions can propagate over the network. For example, in Twitter, an 
action denotes a user posts a tweet, and the action propagation denotes the event that 
the same tweet is re-posted (retweet ed) by his/her followers. In this paper, we adopt 
the independent cascade (IC) model [ Kempe et al. 2003 Kempe et a l. 2005] to model 
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such action propagation process. In th e IC model, every edge (u, v) is associated with 
an influence probability p uv (Fig. 1 1(a)) , which represents the probability that a node v 
performs an action followed by the same action taken by its adjacent node u. We refer 
to a social network G with influence probabilities as an influence network denoted by 
Q = (V, E, P), where the set P represents the set of influence probabilities. We call a 
node an active node if it performs an action. 

The propagation process of the IC model unfolds in discrete steps. More precisely, 
we assume that a node v follows a node u, and at step t node u performs an action a 
and node v does not. Then, node u is given a single chance to influence node v, and 
it succeeds with probability p uv . This probability is independent of other nodes that 
attempt to influence node v. If node u succeeds, then node v will perform action a at 
step t + 1. In other words, node v is influenced by node u at step t + 1. It is important to 
note that whether u succeeds or not, it cannot make any attempts to influence v again. 
The process terminates when there is no new node can be influenced. 

The IC model can be initiated by a single node s such that the node performs an 
action before any other nodes in V\{s}. The seed node s models the source of influence, 
and it can spread across the network following the IC model. The propagation pro- 
cess is a stochastic process, after the process terminates, the number of active nodes 
is a random variable. Therefore, we take the expectation of this random variable to 
measure the influence spread of s, and it is denoted as F S (Q). We refer to the expected 
influence spread of s (i.e. F S (S)) as the influenceability of node s. In this paper, we 
aim to evaluate the influenceability F S (G) given a seed node s. In the following subsec- 
tion, we will give a fo rmal definition of F S (G) based on the probabilistic graph model 
llPotamias et al. 20101 

2.1. Influenceability Evaluation 

Based on the IC mod el, an influence netwo rk Q = (V, E, P) is represented by the proba- 
bilistic graph model MPotamias et al. 20101 , where the existence of an edge is indepen- 
dent of any other edges. 

Given an influence network Q = (V, E, P), we denote a possible graph Gp = (Vp, E P ) 
which can obtained by sampling each edge e in Q according to the influence probability 
p e associated with the edge e (p e e P). Here, we have V = Vp, E P c E, and the possible 
graph Gp has the probability Pr[Gp], which is given by 



The total number of s uch p ossible graphs is 2 m , where m is the number of edges in 
Q. For example, in Fig. |l(a) the influence netw ork Q has 2 10 possible graphs and the 
possible graph G x (Fig.]T?c)| and G 2 (Fig. [T(d)l > have probability Pr[G x ] = 0.000007056 
and Pr[G 2 ] = 0.00003704, respectively. 

According to the IC model, given a seed node s, the influenceability of s, denoted 
by F S (Q), is the expected influence spread over all the possible graphs of Q. Therefore, 
based on the probabilistic graph model, the influenceability F S (Q) can be given by 



where SI denotes the set of all possible graphs of Q, and f s (Gp) is the number of nodes 
that are reachable from the seed node s in the possible graph Gp. Note that f s {Gp) is 
a random variable and its expectation is F S (Q), i.e. F S (Q) = E[/ s (Gp)]. 

A s an example, consider the source node s = v$ in the influence network Q in 
Fig. |l(a)| F S (G) can be computed by enumerating all 2 10 possible graphs, Gp, and com- 



Pr[Gp] = ][ P e 0--Pe). 
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(c) Possible graph G\ with probability (d) Possible graph G2 with probability 

0.000007056, and f B (Gi) = 3 0.00003704, and f s (G 2 ) = 5 



Fig. 1 . A Simple Influence Network 



puting the corresponding Pr[Gp] and f s (Gp). For instance, from Fig. |l(c)| and Fig. |l(d)| 
we have / s (Gi) = 3 and f s {G 2 ) = 5. In this example, the exact F S {Q) is 0.46123456. 

Equipped with the definition of F S {Q), we describe the influenceability evaluation 
problem as follows. 

Problem Statement: Given an influence network Q and a seed node s, the influence- 
ability evaluation problem is to compute the influenceability F S (G) (Eq. ©). 

It is impor tant to note that the influenceability evaluation problem is known to be 
#P-complete llChen et al. 20101 even for the very special influence network where the 
influence probabilities of all edges are equivalent. There is no hope to exactly evalu- 
ate the influenceability in polynomial time unless P = #P. Given the hardness of this 
problem, in this paper, our goal is to develop an efficient and accurate approximate 
algorithm to evaluate F S (Q) given a seed node s. 

An important metric for evaluating the accuracy of an approximate algorithm is 
the mean squared error (MSE), which is denoted by E[(F s (g) - F S (Q)) 2 ], where F S (Q) 
denotes an estimato r of F S {Q) by the approximate algorithm. By the so-called variance- 
bias decomposition HJin et al 2011bl . this metric can be decomposed into two parts. 

E[(F S (G) - F s (g)) 2 } = Var(F s {G)) + [E(F S (G) - F S (G))} 2 , (3) 

where E(F S (Q)) and Var(F s (Q)) denote the expectation and variance of the estimator 
F S (Q), respectively If an estimator is unbiased, then the second term in Eq. (O will 
be canceled out. Therefore, the variance of the unbiased estimator becomes the only 
indicator for evaluating the accuracy of the estimator. 
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3. NAIVE MONTE-CARLO 

In this section, we introduce the Naive Monte-Carlo (NMC) sampling for estimating 
the influenceability F s (Q) given a seed node s, which is the only e x isting algorithm used 
in the influence maximization lit erature | |Kempe et al. 2003} ILeskovec et al. 2007[ 
IChen et al. 2009[ IChen et al. 201011 . This method first samples N possible graphs 
Gi,G 2 , - " ,G N of Q according to the influence probabilities P, and then calculates 
the number of reachable nodes from the seed node s in each possible graph G ly 
i = 1, 2, • • • , N, i.e., / S (G,). Finally, the NMC estimator F N mc is given below. 

N 

Efs(Gi) 

Fnmc = jr • (4) 

The NMC estimator is an unbiased estimator of F S (Q), such that E(F N mc) = F S (Q). 
The variance of the NMC estimator is given as follows. 

var(P NMC ) = mim^mmi 

E Pr[Gp]/ s (G) 2 -F s (S P ) 2 (5) 



N 



Notice that exactly computing the variance Var(F N Mc) is extremely expensive, be- 
cause we have to enumerate all the possible graphs to determine it, whose time com- 
plexity is exponential. In practice, we resort to an unbiased estimator of Var(F N Mc) 
to evaluate the accuracy of the estimator F NMC llJin et al. 2011bi In this case, an un- 
biased estimator of Var (Fnmc) is given by the following equation. 



\2 _ ftrp2 



E (J.(Gi) - F NMC Y £ fs(Gi) - NF NMC 

_ i=l 

N-l ~ N - 1 



Var (Fnmc) = ^ ^ = ^ ^ . (6) 



According to Eq. ©, Var(F N Mc) niay be very large, because the value of f 8 (Gi) falls 

into the interval [0, n — 1], which may result in Var(F N Mc) as large as 0(n 2 ). Here, n 
is the number of nodes in Q. For example, assume f s (Gi) = for i = 1, • • • , N/2 and 

f s (G t ) = n - 1 for i = N/2 + 1, • • • , N, then Var(F NM c) equals to N(n - 1) 2 /4(N - 1) = 
0(n 2 ). Therefore, the key issue that we address in this paper is to design more accurate 
estimators than the NMC estimator for estimating the influenceability F S (Q). 

The NMC algorithm is described in Algorithm [TJ The algorithm works in N iter- 
ations (line 2-5). In each iteration, the NMC algorithm needs to generate a possible 
graph by tossing m biased coins for m edges in Q, which takes 0(m) time complex- 
ity (line 3). Then, the algorithm invokes a BFS algorithm to calculate the number of 
reachable nodes from s, which again causes 0(m) time complexity (line 4). As a result, 
the time complexity of the NMC algorithm is O(Nm). 



4. NEW TYPE-I ESTIMATORS 

In this section, we first introduce an exact algorithm for computing the influenceability 
F S (G), which will guide us to design the new esti mators. We will propose two new 
estimators based on the idea of stratified sampling [Thompson 2 002 1 . Both estimators 
are shown to be unbiased, and their variance is significantly smaller than the variance 
of the NMC estimator. We refer to the two estimators as the type-I estimators. 
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ALGORITHM 1: NMC (0, N, s) 



Input: Influence network Q, sample size N, and the seed node s. 
Output: The NMC estimator F NM c- 

1: Fnmc 0; 

2: for i = 1 to N do 

3: Flip m biased coins to generate a possible graph d; 
4: Compute f s (Gi) by the BFS algorithm; 

5: Fnmc <— Fnmc + f s (Gi); 

6: Fnmc Fnmc /N; 

7: return Fnmc; 



Table I. Probability space partition in the exact algorithm 
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4.1. An exact algorithm 

We introduce an exact divide-and-conquer enumeration algorithm to evaluate the in- 
fluenceability for a given influence network Q = (V, E, P) with n nodes and m edges. 
The main idea of our exact algorithm is described as follows. First, the algorithm di- 
vides the entire probability space CI (all the possible graphs) into T different subspaces 
by randomly enumerating r (r < m) edges that have not been enumerated. Note that 
r is a small number (eg. r = 5). In each subspace, the exact algorithm recursively 
enumerates another r edges, and this process will terminate until all the edges are 
enumerated. The partition method of the exact algorithm is described in Table Jl In 
Table [B "0", "1", and "*" denote the statuses of inexistence, existence, and not-yet- 
enumerated, for the edges, respectively. Each case from 1, 2, • • ■ , to r corresponds to a 
subspace. And fij, for % = 1, 2, • ■ ■ T , denotes the probability space of the case i, which 
represents the set of all possible graphs in the case i. 

To clarify our algorithm, let T = (e\, e 2 , • • • , e r ) be the set of selected r edges, and 
X l = (Xi t i,Xi t 2, ■ ■ ■ ,Xi,r) be the status vector corresponding to the selected r edges 
under the case i, where X ; j = signifies that the edge e 3 does no exist , and X i} j = 1 
otherwise. For example, for case 1 in Tabled! the status vector is X\ = (0, 0, •■■ ,0), 
which means that all the selected r edges do not exist. In other words, all the possible 
graphs in ill do not include the edges in T. The probability of a possible graph in case 
i is given by 

n = Pr[G F e fli] = J] p 3 J] {I -Pi). (7) 

e 3 -eTAXi,j=l e 3 GTAX il3 =0 

In addition, let Ai be the set of edges that have been enumerated, and A 2 be the set of 
edges that have not been enumerated, such that A± U A 2 = E, and A± n A 2 = 0. Then, 
the influenceability of the node s under the case i is defined as 

F a (G(A 1 ,A 2 ,X i ))= / S (Gp)— , (8) 

where Q{A 1 ,A 2 ,X i ) denotes the set of possible graphs in the case i, i.e. f^. According 
to Eq. (O, F s (G(Ai, A 2 , Xi)) denotes the expected spread over all the possible graphs 
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ALGORITHM 2: EXACT (Q, A u A 2 , X, s) 



Input: Influence network Q, the set of edges that have been 
enumerated Ai, the Set of edges that have not been 
enumerated A 2 , sample size N, and the seed node s. 

Output: The exact value of F a (Q) 

1: if A 2 = 0then 

2: Compute f a (G(V, A U A 2 ,X)) by the BFS algorithm; 

3: return f„(G(V, Ax, A 2 , X)); 
4: else 

5: if \A 2 \ < r then 

6: I <- \A 2 \; 

7: else 

8: I <- r; 

9: Select / edges from A 2 randomly; 

10: Let T be the set of selected edges; 

11: F <- 0; 

12: for i = 1 to 2 l do 

13: Let Xi be the status vector of set T under the case i; 

14: Compute ir % by Eq. ((7); 

15: Append Xi to X; 

16: m <- EXACT (S, A 1 UT,A^\T,X,s); 

17: F-S-F + TTiMi; 

18: return F; 



in and PrfGp]/^ is the probability of a possible graph Gp conditioning on it exists 
in fii. It is worth of noting that F S (G) = F s (0(0,.E,0)). Based on Eq. ®, we have the 
following theorem. 

THEOREM 4.1. Le^ F S {Q{A\, A 2 , Xi)) be the influenceability of the node s under the 
case i as defined in Eq. ([3}, and The a set ofr edges randomly selected from A 2 . For any 
T, we have 2 r cases, and let Y } (j = 1, • • ■ , 2 r ) be the corresponding status vector. Then, 
we have 

F s (g(A u A 2 ,X i )) = Y^ =1 ^Fs(Q{A 1 UT,A 2 \T,[X h Y j })), (9) 

where [Xi, Yj] is a new status vector generated by appending Yj to Xi. 

Based on Theorem 14. 1[ we develop a recursive enumeration algorithm described in 
Algorithm [2j Algorithm |2] first partitions the entire probability space Q into 2 r sub- 
spaces, and then t he sa me procedure will be recursively performed on each subspace 
based on Theorem 14.11 (line 9-17 in Algorithm [2j The algorithm terminates until all 
the edges are enumerated. The influenceability F S (Q) can be computed by invoking 
EXACT (Q, 0, E, 0, s). 

The enumeration procedure given in Algorithm [2] can be characterized by a full 2 r - 
ary tree which is depicted in Fig.[2j Note that, to simplify our analysis, here we assume 
that r is divisible by m. In the tree, each node represents a probability space that con- 
sists of a set of possible graphs. For example, the root node denotes the probability 
space that includes the set of all possible graphs, and each leaf node denotes the prob- 
ability space that includes only one possible graph. Each internal node has 2 r children, 
and each child corresponds to a case described in Tabled! To compute F S (Q), we need 
to traverse all the nodes in the tree. Because the number of nodes in the tree is 0(2 m ), 
the time complexity of Algorithm|2]is 0(2 m ). Therefore, the exact algorithm only works 
on small networks due to the nature of #P-complete of the influenceability evaluation 
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Fs(G({ },E,{ })) 



Fs(G(T,E\T,Xi)j^^ X - J *** ^\Fs(G(T,E\T,X2-r)) 



Fig. 2. The Enumeration Tree of the Exact Algorithm. 



problem. In the following, we will develop two types of efficient approximation algo- 
rithms for evaluating the influenceability. 

4.2. Basic stratified sampling estimator (I) 

As discussed in Section [3l the NMC estimator leads to a large variance. To reduce the 
variance, we propose a new stratified sampling estimator for influenceability evalu- 
ation. We call this new estimator the basic stratified sampling (BSS) estimator, be- 
cause it servers as the basis for desig ning recursive stratified sampling (RSS) estima- 
tor which will be described in Section [473l To distinguish the type-II estimators which 
will be introduced in Sectional we refer to the new estimators presented in this section 
as the type-I estimators. Specifically, we refer to the type-I BSS and RSS estimator as 
the BSS-I and RSS-I estimator, respectively. 
Unlike the NMC sampler which draws a sample (a possib le graph) from th e entire 



population (all the possible graphs), the stratified sampling [Thompson 2002) first di- 
vides the population into M disjoint groups, which are called strata, and then indepen- 
dently picks separate samples from these groups. Stratified sampling is a commonly 



used technique for reducing variance I Thompson 20021 in sampling design. There are 
two key techniques in stratified sampling: stratification, which is a process for par- 
titioning the entire population into disjoint strata, and sample allocation, which is a 
procedure to determine the sample size that needs to be drawn from each stratum. 
Below, we will introduce our stratification and sample allocation method. 

Stratification: Our idea of stratification is based on the exact algorithm described in 
the previous subsection. First, we choose r edges and determine their statuses (0/1), 
where r is a small number. Recall that this process generates 2 r various cases as 
shown in Table HI and thereby it partitions the set of possible graphs VI into T sub- 
sets fix, • • • , Sly. Second, we let each subset be a stratum. This is because fii, • • • , Q2 r 
are disjoint sets and O = Uj=i thus each case is indeed a valid stratum. It is worth 
of mentioning that our stratification process corresponds to the top two layers in the 
enumeration tree (Fig. |2]), the root node denotes the entire population, and each child 
represents a stratum. The stratification process is depicted in Table HU 
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Table II. Stratum design of the BSS-I/RSS-I estimator 
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Stratum 2 


10 0- 


■ 


* - - ■ 
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Stratum 3 
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■ 


* - - ■ 


* 


n 3 


Stratum 2 r 


111" 


■ 1 


* - - ■ 







In our stratification approach, a question that arises is how to select the r edges 
for stratification. As shown in our experiments, the edge-selection strategy for choos- 
ing r edges significantly affects the performance of the estimator. One straightforward 
strategy is to randomly pick r edges from the edge set E. We refer to this edge selec- 
tion strategy as the random edge-selection strategy. With this strategy, the selected 
r edges may n ot ha ve direct contributions for computing the influenceability. For ex- 
ample, in Fig. |l(b)[ for the source node s = v 5 , assume r = 2 and the selected edges 
are {v\ — > v 2l v 6 — > v 2 }. The edges {vi — > v 2 ,v 6 — > v 2 } have no direct contributions for 
calculating the influenceability F S (Q). This may reduce the performance of the BSS-I 
estimator. For avoiding such a problem, we introduce another heuristic edge-selection 
strategy based on the BFS visiting order of the edges. To estimate F S (Q), we first per- 
form a BFS algorithm starting from the node s to obtain the first r edges according 
to the BFS visiting order of the edges. Then, we use these r edges for stratification. 
We refer to such edge-sel ection strategy as the BFS edge-selection strategy. Consider 
the same example in Fig. |l(b)j assume r = 2, the first r edges are {v 5 — > v 3 , v 5 — > v e }. 
Then, we partition the population into 4 strata according to the statuses of these two 
edges. Obviously, according to the BFS edge-selection strategy, the selected edges have 
direct contribution to calculate the influenceability. In our experiments, we find that 
the performance of the BSS-I estimator with BFS edge-selection strategy is signifi- 
cantly better than the performance of the BSS-I estimator with random edge-selection 
strategy. 

The BSS-I estimator: Let N be the total number of samples, TV; be the number of 
samples drawn from the stratum i (i = 1, 2, • • • , 2 r ), and Gjj (j = 1, 2, • • • , be a 
possible graph sampled from the stratum i. Then, the BSS-I estimator is given as 
follows. 

P BSSI = EL Tij^ hiG ^ (10) 

where 7r; is defined in Eq. (O. The following theorem shows that Fbssi is an unbiased 
estimator of the influenceability F S (Q). 

Theorem 4.2. F S (Q) = E(F BSS i)- 

Proof. We prove it by the following equalities. 

E(F B ssi) = E(£L tt^ Efix UGij)) 

= E Gp& n^[Gp)fs(Gp) 
= F S (G) 

□ 

Let er; be the variance of the sample in the stratum i. Since the samples are inde- 
pendently drawn by the basic stratified sampling algorithm, thus the variance of the 
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BSS-I estimator is given by 

Var{F BS si) = Y?- =l ^^ < n > 

where m is given in Eq. 10. 

Sample allocation: As discussed above, the BSS-I estimator is unbiased and the 
variance of the BSS-I estimator depends on the sample size of all the strata, i.e., Ni, 
for i = 1, 2, • • ■ 2 r . Thus, a question that arises is how to allocate the sample size for 
each stratum % (i = 1, 2, • • • , 2 r ) to minimize the variance of the BSS-I estimator, i.e. 
Var(F B ssi)- Formally, the sample allocation problem is formulated as follows. 

min Var(F BSSI ) = Ei=i (12) 
E-li N = N. 

By applying the Lagrangian method, we can derive the optimal sample allocation as 
given by 

Ni = Nw i ^l/Y' 2 ir^, (13) 

for i = 1, ■ ■ ■ , 2 r . From Eq. d!3l >. the optimal allocation needs to know the variance of 
the sample in each stratum, i.e. c^, for i = 1, ■ • • , 2 r . However, such variances are un- 
available in our problem. Interestingly, we find that, if the sample size of the stratum 
i is allocated to niN, then the variance of the BSS-I estimator will be smaller than the 
variance of the NMC estimator. We have the following theorem. 

THEOREM 4.3. IfNi = ■n l N, then Var(F BSS i) < Var(F NM c)- 

PROOF. If N t = mN, then we have Var(F B ssi) = Y,T=i^w- Let M» = E (f*( G i,j)) 
be the expectation of the sample in the stratum i. By definition, we have cr, = 

E(/ s (G iij ) 2 ) - M? = £ Gp6tll fs(G P ) 2 ^ - fi. Then, we have 

Var(F BSSI ) = i YLi ^(E Gp£fii fs(G P ) 2 ^ - fx 2 ) 
= jr^ti (Ec^n, / s (G P ) 2 Pr[Gp] - 7r iM ?) 
= wE GP en^iGp}fs(Gp) 2 - jrT,tiWl 

Given this, we can derive the difference between Var(F BS si) and Var(F NM c) (Eq. ©) 
as follows: 

Var(F NM c) - Var(F BS si) 

= MEli^-(nfs(Gp)]) 2 ) 

= M^li^l - (E Gpe ^Pr[Gp]/ s (G P )) 2 ) 

4(E",^h(EL^ E ^/.(Gp)) 2 ) 
= iKEili TiM? - (Efli ^^) 2 ) 

~ N 

> o. 

Note that in the last equality /i, can be treated as a random variable. Then, we have 

Ei=i ^iVi = ^(^1) an d (E(pii)) 2 = (Ei=i ^i/^i) 2 ' thus the last equality holds. This com- 
pletes the proof. □ 
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ALGORITHM 3: BSS-I (G, N, s) 



Input: Influence network Q, sample size N, and the seed node s. 
Output: The BSS-I estimator F. 

1: F <- 0; 

2: Choose r edges according to an edge-selection strategy; 
3: for i = 1 to 2 r do 

4: Let Xi be the status vector of stratum i; 
5: Compute 7r; by Eq. ((7); 

6: Ni <- [7v t N]; 
7: t<-0; 

8: for j = 1 to iVj do 

9: Flip m — r coins to determine the rest m — r edges; 
10: Let Yj be the status vector of the rest in — r edges; 
11: Append X; to Yj to generate a possible graph Gj; 
12: Compute / s (Gj ) by the BFS algorithm; 

13: t <- t + f,(Gj); 

14: t<-t/Ni; 
15: F <— F + -Kit; 
16: return F; 



The BSS-I algorithm: Given the stratification and sample allocation methods, we 
present our basic stratified sampling algorithm in Algorithm [3] First, Algorithm [3] se- 
lects r edges to partition the population into 2 r strata according to an edge-selection 
strategy (line 2), either random or BFS edge-selection. For convenience, we refer to 
the BSS-I estimator with random edge-selection and the BSS-I estimator with BFS 
edge-selection as BSS-I-RM and BSS-I-BFS estimator, respectively. Second, according 
to our sample allocation method, the algorithm draws 7riiV samples from the stratum 
i (line 8-13). Finally, the algorithm outputs the BSS-I estimator F B ssi- Notice that it 
takes 0(m) time for both generating a possible graph G and performing BFS on G. 
Besides, the algorithm needs to draw possible graphs. Hence, the time complexity 
of Algorithm [3] is 0(mN), which has the same complexity as the NMC estimator. How- 
ever, our BSS-I estimator significantly reduces the variance of the NMC estimator. The 
advantages of the BSS-I estimator are twofold. On one hand, given the sample size, the 
BSS-I estimator is more accurate than the NMC estimator as it has a smaller variance. 
On the other hand, to achieve the same variance, the BSS-I estimator needs a smaller 
sample size than that of the NMC estimator, thus it reduces the time complexity of the 
sampling process. 

4.3. Recursive stratified sampling estimator (I) 

Recall that the BSS-I estimator splits the entire set of possible graphs into 2 r subsets, 
which corresponds to the top two layers in the enumeration tree (Fig. [2). Interest- 
ingly, we observe that the basic stratified sampling (BSS-I) can be applied into any 
internal nodes of the enumeration tree. Based on this observation, we develop a re- 
cursive stratified sampling estimator, namely RSS-I estimator, which is described in 
AlgorithmlU The RSS-I estimator recursively partitions the sample size N to Ni = TTiN 
(i = 1, 2, • • • , 2' ) for estimating the influenceability at the stratum i (line 9-19). Note 
that since the BSS-I estimator is unbiased, the RSS-I estimator is also unbiased. More- 
over, RSS-I reduces the variance at each partition, thus the variance of RSS-I is signif- 
icantly smaller than the variance of BSS-I as stated by the following theorem. 
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THEOREM 4.4. Let Var(F RS si) be the variance of RSS-I, then Var(F RS si) < 
Var(F B ssi)- 

PROOF. We focus on the case that RSS-I only partitions the population 2 r + 1 times. 
Similar arguments can be used to prove the case of more partitions. At the first parti- 
tion, RSS-I splits the population into 2 r strata, which is equivalent to BSS-I. In each 
stratum i (i = 1, ■ ■ • , 2 r ), RSS-I recursively partitions it into T sub-strata. Let fij, fii, 
Pi and N { be the probability space, the expectation, the variance, and the sample size 
of the stratum i at the first partition, respectively. Let n t = Pr[Gp e be the prob- 
ability of a sample in stratum i as defined in Eq. (0. Similarly, for each stratum i, 
we denote the probability space, the expectation, the variance, and the sample size of 
the sub-stratum fc (k = 1, • ■ ■ , 2 r ), as S\fc, Hi^, &i,k, and A^, respectively. Further, we 
denote the probability of a sample in a sub-stratum fc as m t k, i-e., 7r l fe = Pr[Gp £ fi^]. 
Then, we have ir hk = -k.^u, where w fc denotes the probability of a sample in sub- 
stratum k conditioning on it is in stratum i, i.e., cjfc = Pr[Gp e fij^lGp € fij. 

The ESS-J estimator is given by F RSS i = E*=i ELi ^jvT^ E^i f*( G i,k,j), where 
(j = 1, ■•• jiV^fc) denotes a possible graph sampled from the sub-stratum fc of 
the stratum i. Then, the variance of RSS-I is V r ar(F flSS/ ) = Y%=i Efcli ~%r^"' By 
our sample allocation strategy, we have iV^ = Nw^k, thereby the variance can be 
simplified to Var(F RSSI ) = Y%=i w E/Li n%,k<ri,k Further, by 7r ijA; = mu} k , we have 
Var^FRssj) = E»=i lv E*=i w fc°i,fc- By the proportional sample allocation, we have 

Var(FBssi) = E»=i W' '*- Therefore, the proof is completed followed by Efc=i w fc°»,Ai < 
cTj. By definition, we have 

EfeLi wfcCT-*,jb = Efeli w fc (E(/ s (G i , fcJ ) 2 ) - nl k ) 

= Efcli E Gpefil k ^fs(G P ? - E 2 ; =1 

Then, we have 

This completes the proof. □ 

The RSS-I algorithm terminates until the sample size becomes smaller than a given 
threshold (r) or the number of unsampled edges smaller than r (line 2). When the 
terminative conditions of the RSS-I algorithm satisfy, we perform a naive Monte-Carlo 
sampling for estimating the influenceability (line 3-7). 

Similar to the BSS-I estimator, the partition approach in RSS-I estimator also de- 
pends on the edge-selection strategy (line 9). Likewise, we have two edge-selection 
strategies for the RSS-I estimator, either random edge-selection or BFS edge-selection. 
For convenience, we refer to the RSS-I estimator with random edge-selection and with 
BFS edge-selection as the RSS-I- RM an d RSS-I-BFS estimator, respectively 

Reconsider the example in Fig. |l(b)| the BFS visiting order of the edges is {v 5 — > 

V 3 ,V 5 -> V 6 ,V 3 -> Vl,V 3 -> V4,,V 6 -t V 2 ,Vi -> V 2 ,Vi -> V 3 ,Vl -t V4,V 4 -> v 6 ,v 2 -> v 6 }. 

Assume r = 2, according to the BFS visiting order, then the RSS-I-BFS first picks 
edge v 5 v 3 and v 5 w 6 for stratification, and then selects the edges v 3 — s- vi and 
v 3 -> «4, and so on. It worth of mentioning that we can invoke the procedure RSS-I 
(Q, 0, E, 0, N, s), where s is the seed node, to calculate the RSS-I estimator. 



13 



ALGORITHM 4: RSS-KG, E u E 2 , X, N, s) 



Input: Influence network Q, the set of sampled edges Ei, the set of 

unsampled edges E2, sample size N, and the seed node s. 
Output: The RSS-I estimator F. 

1: F «- 0; 

2: if N < t or |£ 2 | < r then 
3: for j = 1 to N do 

4: Flip \E2 1 coins to generate a possible graph Gj ; 
5: Compute f a (Gj ) by the BFS algorithm; 
6: F^F + fs(G 3 ); 
7: return F/N; 
8: else 

9: Select r edges from E 2 according to an edge-selection strategy {Random or BFS visiting 
order}; 

10: Let T be the set of selected edges; 
11: for i = 1 to 2 r do 

12: Y X {Recording the current status vector X}; 
13: Let Xi be the status vector of set T in stratum i; 
14: Append X, to 1"; 
15: Compute m by Eq. ((7); 

16: Ni <H [TTiTV]; 

17: At, <- RSS-J (5, £1 U T, E 2 \T, Y, iV«, s); 
18: F^F + mm; 
19: return F; 



We analyze the time complexity of Algorithm |H For sampling a possible graph, Al- 
gorithm [4] needs to traverse the enumeration tree (Fig. [2]) from the root node to the 
terminative node. Here the terminative node is a node in the enumeration tree where 
the terminative conditions of the recursion satisfy at that node, i.e. JV < t or |i?2| < r 
holds in Algorithm |U Let d be the average length of the path from the root node to 
the terminative node. Then, by analysis, the time complexity of the algorithm at each 
internal node of the path is 0(r)._ Suppose that the total number of such paths is K. 
Then, the algorithm takes O(Kdr) time complexity at the internal nodes of all the 
paths. Note that K is bounded by the sample size N, and d is a very small number 
w.r.t. N. More specifically, we can derive that d = 0(\og 2r N), which is a very small 
number. For example, assume r = 5 and N = 100,000, then we can get d ~ 3.3. For 
all the terminative nodes, the time complexity of the algorithm is 0(Nm). This is be- 
cause the algorithm needs to sample N possible graphs in total over all the terminative 
nodes, and for each possible graph the algorithm performs a BFS to compute the influ- 
enceability which takes 0{m) time complexity. Since O(Kdr) is dominated by O(Nm), 
the time complexity of Algorithmic is 0(Nm + Kdr) = 0{Nm). 



5. NEW TYPE-II ESTIMATORS 

In this section, we propose two new stratified sampling estimators, namely type-II 
basic stratified sampling (BSS-IT) estimator and type-II recursive stratified sampling 
(RSS-II) estimator. The BSS-II and RSS-II are shown to be unbiased and their variance 
are significantly smaller than the variance of the NMC estimator. In the following, we 
first introduce the BSS-II estimator, and then present the RSS-II estimator. 
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Table III. Stratum design of the BSS-II/RSS-II estimator 
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5.1. Basic stratified sampling estimator (II) 

Stratification: We propose a new stratification method for the BSS-II estimator. This 
new stratification method splits the entire probability space O into r + 1 various sub- 
spaces (f2rj, ■ ■ ■ , fi r ) by choosing r edges. Specifically, for stratum 0, we set the statuses 
of all the r selected edges to "0", and for the stratum i (i 0), we set the status of edge i 
to "1" and the statuses of all the previous i - 1 edges (i.e. e±, ■ ■ ■ , ej_i) to "0". Unlike the 
stratification method of the BSS-I estimator, this new stratification approach allows 
us to se t r t o be a big number, such as r = 50. The stratum design method is depicted 
in Table HIl 

In Table IIII1 each stratum (Stratum 0, Stratum 1, ■ ■ ■ , Stratum r) corresponds to a 
subspace (VLq, Q,\, ■ • • , O r ). For any i ^ j, we have n flj = 4>. Below, we show that 
Ui=o ^< = fi- Let T — (ei, e 2 , • ■ • , e r ) be the set of r selected edges and p t (i = 1) be the 
corresponding influence probability, then the probability of a possible graph in stratum 
i is given by 

f ft */* = o 

n't = Pr[G P e SU] = I J ~\-i (14) 
Pi EI (1 — otherwise 

The following theorem implies U^ =0 fij = fi. 

Theorem 5.1. Pr[G P e fi] = E[=o Pr I G P e = L 
Proof. We prove it by the following equalities. 

= n, r =i (1 - Pi) + Pi + (1 - Pi)Pa + 
= YYjZl (1 - Pi) +Pi + (1 -Pi)pa + 

= 1 — pi +pi 
= 1 

□ 

Arme d wi th Theorem 15. 11 we conclude that the stratum design approach described 
in Table HU is a valid stratification method. 

The BSS-II estimator: Similar to the BSS-I estimator, we let N be the total sample 
size, and iVj be the sample size of the stratum i, and G it j (j = 1, 2, ■ ■ • , N t ) be a possible 

graph sampled from the stratum i. Then the BSS-II estimator Fbssii is given by 

Fbssii = ]T =0 E^t ^ G ^< (15) 

where 7r- is given in Eq. d!4l >. Similar to Theorem l4.2[ the following theorem s hows that 
the BSS-II estimator is unbiased. The proof is similar to the proof of Theorem l4.21 thus 
we omit for brevity. 



■■■+Pr ]li=l (!-Pj) 

• • • + Pr-iUjZl (!-Pi) 
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Theorem 5.2. F s (g) = E(F BS sn). 



The variance of the BSS-II estimator is given by 



Var(F B ssn) = > 




(16) 



where er^ denotes the variance of the sample in the stratum i. 

Sample allocation: Analogous to the BSS-I estimator, for the BSS-II estimator, we 
can derive that the optimal sample allocation is given by N t = Ni^y/oif J2l=o n iV^i- 
This optimal allocation strategy needs to know the variance of the sample in each stra- 
tum, which is impossible in our problem. Therefore, similar to the sample allocation 
approach used in the BSS-I estimator, for the BSS-II estimator, we set the sample size 
of the stratum i equals to ir^N, i.e. iVj = ir^N. On the basis of this sample allocation 
method, we show that the variance of the BSS-II estimator is smaller than the variance 
of the NMC estim ator as stated by the following theorem. The proof of the theorem is 
similar to theorem !4.31 thus we omitted for brevity. 

THEOREM 5.3. IfN t = n^N, Var(F BSSII ) < Var(F NMC ). 

However, it is very hard to compare the variance of the BSS-II estimator with the 
variance of the BSS-I estimator. In our experiments, we find that these two estimators 
achieve comparable variance. 

The BSS-II algorithm: With the stratification and sample allocation method, we de- 
scribe the BSS-II algorithm in Algorithm [5l Algorithm [5] picks r edges to split the 
entire population into r + 1 strata in terms of an edge-selection strategy (line 2). Any 
of the two edge-selection strategies (random edge-selection and BFS edge-selection) 
used in the BSS-I algorithm can also be used in the BSS-II algorithm. We refer to the 
BSS-II estimator with the random edge-selection and the BSS-II estimator with BFS 
edge-selection as BSS-II-RM and BSS-II-BFS estimator, respectively. In terms of the 
sample allocation method of the BSS-II estimator, Algorithm [5] picks Ni = t:[N sam- 
ples from the stratum i, for i = 0, 1, • • • , r, and outputs the BSS-II estimator F BS sii- 
Like the BSS-I estimator, the time complexity of BSS-II estimator is O(Nm). This is 
because the BSS-II needs to draw N possible graphs, and both sampling each possible 
graph G and computing F S {G) take 0(m) time. 

5.2. Recursive stratified sampling estimator (II) 

Based on the BSS-II estimator, in this subsection, we develop another new recursive 
stratified sampling estimator, namely RSS-II estimator. Similar to the idea of the RSS- 
I estimator, the RSS-II estimator makes use of the BSS-II estimator as the basic com- 
ponent and recursively applies the BSS-II estimator at each stratum. More specifically, 
the RSS-II estimator first partitions the entire probability space £1 into r + 1 subspace 
Qi (i = 0, 1, ■ • ■ , r) according to the stratification method of the BSS-II estimator. The 
same partition procedure is recursively performed in each subspace Qj. At each parti- 
tion, the RSS-II estimator utilizes the same sample allocation method as the BSS-II 
estimator to allocate the sample size. The recursion process of the RSS-II estimator 
will terminate until the sample size is smaller than a given threshold (r) or the num- 
ber of unsampled edges is smaller than r. Since the BSS-II estimator is unbiased, the 
RSS-II estimator is also unbiased. The variance of the RSS-II estimator is smaller 
than the variance of the BSS-II estimator, because the RSS-II estimator recursively 
reduces variance at each partition while the BSS-II estimator only reduces variance 
at one partition. Similar to Theorem |4.4[ we have the following theorem. 
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ALGORITHM 5: BSS-IKG, N, s) 



Input: Influence network Q, sample size N, and the seed node s. 
Output: The BSS- II estimator F. 

1: F <- 0; 

2: Select r edges according to an edge-selection strategy; 

3: for i = to r do 

4: Compute Tv'i by Eq. < Tl4t : 

5: iV s <- K'TV]; 

6: t <- 0; 

7: ifi = 0then 

8: fe «- r; 

9: else 

10: fc «- i; 

11: Let Ei be the set of edges to be determined under stratum i; 
12: for j = 1 to Ni do 

13: Flip m — k coins to determine , and thus generate a possible graph Gy, 
14: Compute /„ (G 3 ) by the BFS algorithm; 

15: t <-t + f,(Gj); 

16: t <- t/Ni; 
17: F^F + ir'it; 
18: return F; 



THEOREM 5.4. Let Var(F RS sn) be the variance of RSS-I, then Var(F RS sii) < 
Var(F B ssn)- 

The detail algorithm of the RSS-II estimator is described in Algorithm [6) Firs, ac- 
cording to an edge-selection strategy, Algorithm [6] selects r edges from the unsampled 
edge-set, which is denoted by E 2 , to partition the population into r + 1 strata (line 9). 
Note that the random edge-selection and BFS edge-selection strategy used in the RSS-I 
estimator can also be applied in the RSS-II estimator. We refer to the RSS-II estimator 
with random edge-selection and BFS edge-selection as the RSS-II-RM and RSS-II- 
BFS estimator, respectively. Second, according to the sample allocation method, the 
algorithm recursively invokes the RSS-II algorithm with sample size Ni in stratum i, 
for i = 1, ■ • ■ , r (line 11-23). In line 15 and line 20, we let Xj be the status vector of 
the selected edges under the stratum i. Unlike the RSS-I estimator, the status vector 
of the RSS-II est imator is determined by the stratification method of the BSS-II esti- 
mator (Table HID . For example, at the first partition of the RSS-II estimator, assume 
T = (ei, e 2 , • • • , e r ) is the set of r edges selected, the status vector of these selected 
edges under the stratum is Xq = (0, 0, • • • , 0). The status vector under the stratum i 
is Xi = (0, • • ■ , 0, 1, * ■ • • , *), where the statuses of the first i—1 edges are "0", the status 
of the i-th edge is "1", and the rest r — i edges are Finally, the algorithm outputs 
the RSS-II estimator (line 24). 

Like the RSS-I estimator, to sample a possible graph, the RSS-II algorithm needs 
to traverse the recursive tree from the root node to the terminative node. At all the 
terminative nodes, the algorithm needs to sample N possible graphs in total, and for 
each possible graph it needs to perform a BFS to compute the influenceability, thus 
the time complexity is 0(Nm). At each internal node in a path from the root node to 
the terminative node, the time complexity is 0(r). This is because at each internal 
node the algorithm only needs to select r edges and determine their statuses which 
consume 0(r) time complexity. Let d be the average length of such path and K be the 
total number of paths. Then, for all the internal nodes, the algorithm takes O(Kdr) 
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ALGORITHM 6: RSS-II (G, E u E 2 , X, N, s) 



Input: Influence network Q, the set of sampled edges Ei, 
the set of unsampled edges E 2 , sample size N, 
and the seed node s. 

Output: The RSS-II estimator F. 



1: F <- 0; 

2: if N < t or \E 2 \ < r then 
3: for j = 1 to N do 

4: Flip E 2 1 coins to generate a possible graph Gj ; 
5: Compute f 3 (Gj ) by the BFS algorithm; 
6: F^F + f s (Gj); 
7: return F/N; 
8: else 

9: Select r edges from E 2 according to an edge-selection strategy (random or BFS visiting 
order); 

10: Let T = (ei, e 2 , ■ ■ ■ , e r ) be the set of selected edges; 

11: for i = to r do 

12: Compute Ti'i by Eq. ( flit : 

13: iV, [^iV]; 

14: if i = then 

15: Let X () be the status vector of set T under stratum 0; 

16: Append Xg to X; 

17: m <- RSS-II (G, £1 U T, £ 2 \T, X, N % , s); 

18: else 

19: LetTi <- {ei,-- - ,ei}; 

20: Let -X", be the status vector of set T, under stratum i; 

21: Append X, to X; 

22: Aw RSS-II (G, E! U T;, B 2 \Ti, X, JVi, s); 

23: F^F + tt'^u 

24: return F; 



time complexity. According to the terminative condition given in Algorithm [6l we can 
derive that d = minjlogj. N, log r m). Since r can be a big number (eg. r = 50), d is very 
small. Thus, the time complexity at the internal nodes O(Kdr) can be dominated by 
O(Nm). We conclude that the average time complexity of Algorithm [6] is O(Nm). 

6. EXPERIMENTS 

We conduct experimental studies for different estimators over four datasets. We con- 
firm the efficiency and accuracy of the proposed estimators. In the following, we first 
describe the experimental setup, and then report our results. 



6.1. Experimental setup 

Datasets: We use one synthetic dat aset and three re al datasets in our experiments. We 
apply the same parameters used in HJin et al. 2011bH to generate the synthetic dataset. 
For the graph topology, we generate an Erdos-Renyi (ER) random graph with 5,000 
vertices and edge density 10. For the influence probabilities, we generate a probability 
for each edge according to a [0,1] uniform distribution. 

The three real datasets are given as follows. (1) FacebookLike dataset: this dataset 
originates from a Facebook social network for students at University of California, 
Irvine. It contains th e users w ho sent or received at least one message. We collect 
this dataset from (jtoreopsahl . com/datasets|>. The dataset is a weighted graph, and 
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Table IV. Summary of the datasets 



Name 


Nodes 


Edges 


Ref. 


Random graph 
FacebookLike 
Condmat 
DBLP 


5,000 
1,899 
16,264 
78,648 


50,616 
20,296 
95,188 
376,515 


IJinetal. 20 lib I 

l Opsahl and Panzarasa 2009] 


(Newman 20011 
IZhou etal. 2010) 



the weight of each edge denotes the number of messages passing over the edge. (2) 
Condmat dataset: this dataset is a weighted collaboration network, where the weight 
of an edge represents the number of co-authored papers between two collaborators. 
We download this dataset from (www-personal.umich.edu/~mejn/netdata). (3) DBLP 
dataset: this dataset is also a weighted collaboration network, where the weight of 
the edge si gnifies the numbe r of co-authored papers. This dataset is provided by the 
authors in [Zhou et al. 2010) . Table HVl summarizes the information for the four real 
datasets. To obtain the influence networks, for each real data set, we generate the 
influence probab ilities according to the same method used in UPotamias et al. 2010[ 
IJin et al. 2011bi Specifically, to generate the probability of an edge, we apply an ex- 
ponential cumulative distribution function (CDF) with mean 2 to the weight of the 
edge. 

Different estimators: In our experiments, we compare 10 estimators. (1) The NMC 
estimator, which is the Naive Monte-Carlo estimator. (2) RSS-I-RM (r = 1), which 
is a speci al RSS-I-RM est imator where the parameter r = 1, based on work pre- 
sented in MJin et al. 20 lib! for computing distance-constraint reachability on uncer- 
tain graph. We also generalize their estimator to arbitrary parameter r, and apply the 
generalized estimator for influenceability evaluation. Recall that beyond the random 
edge-selection strategy, we propose a more accurate RSS-I estimator with BFS edge- 
selection strategy. (3) BSS-I-RM, which is the BSS-I estimator with the random edge- 
selection. (4) BSS-I-BFS, which is the BSS-I estimator with the BFS edge-selection. 
(5) RSS-I-RM, which is the RSS-I estimator with the random edge-selection. (6) RSS-I- 
BFS, which is the RSS-I estimator with the BFS edge-selection. (7) BSS-II-RM, which 
is the BSS-II estimator with the random edge-selection. (8) BSS-II-BFS, which is the 
BSS-II estimator with the BFS edge-selection. (9) RSS-II-RM, which is the RSS-II esti- 
mator with the random edge-selection. (10) RSS-II-BFS, which is the RSS-II estimator 
with the BFS edge-selection. 

Evaluation metric: Two metrics are used to evaluate the performance of the esti- 
mators: running time and relative variance. The running time evaluates the efficiency 
of the estimators. The relative variance is leveraged to evaluate the accuracy of the 
estimators. Let <jnmc be the variance of the NMC estimator. We calculate the relative 
variance of an estimator F by crp/a NMC . Since computing the exact variance of the 
estimators is intractable, we resort to an unbiased estimator of the variance. Similar 
evaluation metric has been used in HJin et al. 2011bH . Specifically, for a given seed node 
s in our experiments, we run all the estimators F S (Q) 500 times, thereby we can obtain 
500 estimating results: F^ 1) {G),F^ ) {g), ■ ■ ■ ,F^ 500) (g). An unbiased variance estimator 
of F a (g) is given by 

E' 50< ! {F^{g)-P s {g)f/m, 
* — *i=i 

where F s (g)) denotes the mean of the 500 various estimating results. 

Parameter settings and the experimental environment: Without specifically 
stated, in all of our experiments, we set the parameters as follows. For all estimators, 
we set the sample size N = 1, 000. For the BSS-I and RSS-I estimators, we set r = 5, 
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Table V. Results on random graph dataset 



Estimators 


Relative variance 


Running time (s) 


NMC 


1.0000 


0.3593 


RSS-I-RM(r = 1) 


0.6723 


0.3558 


BSS-I-RM 


0.9429 


0.3497 


BSS-I-BFS 


0.8938 


0.3748 


RSS-I-RM 


0.3397 


0.3373 


RSS-I-BFS 


0.2056 


0.3783 


BSS-II-RM 


0.9321 


0.3633 


BSS-II-BFS 


0.9042 


0.3749 


RSS-II-RM 


0.3512 


0.3716 


RSS-II-BFS 


0.2063 


0.3847 



and for the BSS-II and RSS-II estimators, we set r = 50. For the threshold parameter 
t in Algorithm [4] and Algorithm [6l we set t — 10. All the experiments are conducted 
on the Scientific Linux 6.0 workstation with 2xQuad-Core Intel(R) 2.66 GHz CPU, and 
4G memory. All algorithms are implemented by GCC 4.4.4. 

6.2. Experimental Results 

For all the experiments, we randomly generate 1,000 seed nodes, and the results are 
the average result over all the seeds. We report our experimental res ults on ran dom 
graph, FacebookLike, Condmat, and DBLP dataset in Table El Table EH Table EH 
and Table IVIIIL respectively. 

From Table El among all the estimators, we can observe that the RSS-I-BFS is the 
winner on the random graph dataset, the RSS-I-RM, RSS-II-RM, and RSS-II-BFS esti- 
mators are significantly better than the RSS-I-RM (r = 1) estimator. The specific RSS- 
I-RM (r = 1) estimator outperforms the BSS estimators, and all the BSS estimators 
are better than the NMC estimator. In particular, RSS-I-BFS reduces the relative vari- 
ance over the NMC and RSS-I-RM (r = 1) estimators by 386% and 227%, respectively. 
RSS-II-BFS cuts the relative variance over NMC and RSS-I-RM (r = 1) by 385% and 
226%, respectively. Both RSS-I-RM and RSS-II-RM estimators cut the relative vari- 
ance over the NMC and the RSS-I-RM (r = 1) estimators more than 185% and 91.4%, 
respectively. For the BSS estimators, their performance is worse than the RSS-I-RM 
(r = 1) estimator, but are significantly better than the NMC estimator. In addition, 
the running time of all the estimators are comparable. These results consist with our 
analysis in Section [4] and Section [5] 

From Table ED we can see that RSS-II-BFS achieves the best relative variance on 
the FacebookLike dataset, followed by RSS-I-BFS, RSS-II-RM, RSS-I-RM, RSS-I-RM 
(r = 1), the BSS estimators, and the NMC estimator. More specifically, the RSS-II-BFS 
estimator reduces the relative variance over the NMC estimator and the RSS-I-RM 
(r = 1) estimators by 317% and 133%, respectively. The RSS-I-BFS estimator reduces 
the relative variance over NMC and RSS-I-RM (r = 1) by 289% and 117%. Both RSS- 
I-RM and RSS-II-RM estimators cut the relative variance over NMC and RSS-I-RM 
(r = 1) more than 231% and 184%, respectively. Similar to the result on the random 
graph dataset, all the BSS estimators are slightly worse than the RSS-I-RM (r = 1) 
estimator but are significantly better than the NMC estimator. Also, the running time 
of all the estimators are comparable because the time complexities of all the estimators 
are O(Nm). These results confirm our a naly sis in the previous sections. Simil ar results 
can be observed in the Condmat (Table ElD and DBLP datasets (Table EUD- 

To summarize, RSS-I-BFS and RSS-II-BFS achieve the best relative variance, and 
they reduce the relative variance over the existing estimators several times. The RSS 
estimators are better than the BSS estimators. The BSS/RSS estimators with the BFS 
edge-selection strategy are better than the BSS/RSS estimators with the random edge- 
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Table VI. Results on FacebookLike dataset 



Estimators 


Relative variance 


Running time (s) 


NMC 


1.0000 


0.2007 


RSS-I-RM (r = 1) 


0.5585 


0.2014 


BSS-I-RM 


0.8898 


0.2331 


BSS-I-BFS 


0.6819 


0.2354 


RSS-I-RM 


0.3023 


0.2002 


RSS-I-BFS 


0.2570 


0.2010 


BSS-II-RM 


0.6947 


0.2250 


BSS-II-BFS 


0.6672 


0.2284 


RSS-II-RM 


0.2786 


0.2027 


RSS-II-BFS 


0.2397 


0.2037 


Table VII. Results on Condmat dataset 


Estimators 


Relative variance 


Running time (s) 


NMC 


1.0000 


1.2969 


RSS-I-RM (r = 1) 


0.7950 


1.2958 


BSS-I-RM 


0.9068 


1.3043 


BSS-I-BFS 


0.8531 


1.3054 


RSS-I-RM 


0.4883 


1.2050 


RSS-I-BFS 


0.1971 


1.2411 


BSS-II-RM 


0.8553 


1.2513 


BSS-II-BFS 


0.8421 


1.3104 


RSS-II-RM 


0.4891 


1.2256 


RSS-II-BFS 


0.2120 


1.2284 


Table VIII. Results on DBLP dataset 


Estimators 


Relative variance 


Running time (s) 


NMC 


1.0000 


8.5824 


RSS-I-RM (r = 1) 


0.5375 


8.6536 


BSS-I-RM 


0.9170 


8.6292 


BSS-I-BFS 


0.8373 


8.8173 


RSS-I-RM 


0.2100 


8.3835 


RSS-I-BFS 


0.1918 


8.5933 


BSS-II-RM 


0.9449 


8.8825 


BSS-II-BFS 


0.7997 


9.1305 


RSS-II-RM 


0.2003 


8.6840 


RSS-II-BFS 


0.1821 


8.7052 



selection strategy. All of our RSS estimators outperform the RSS-I-RM (r = 1) estima- 
tor. The proposed BSS estimators are slightly worse than the RSS-I-RM (r = 1) esti- 
mator, but still significantly outperform the NMC estimator. The running time of all 
the estimators are comparable. 

Scalability: In order to study the scalability of various estimators, we generate syn- 
thetic probabilistic graphs Q with nodes ranging from 200,000 (200k) to 800,000 and 
the edges ranging from 800,000 to 3,200,000 (3.2m) according to the ER random graph 
model. And the probability of each edge is randomly generated according to a [0, 1] uni- 
form distribution. Also, for each estimator, we set the sample size TV to 1,000. Table HXl 
shows the running time of different estimators on four large synthetic probabilistic 
graphs. As can be seen in Table HXl the running time increases as the size of the graph 
increases. In general, all the estimators achieve comparable running time, and they 
have linear growth w.r.t. the graph size. These results consist with the complexities of 
our estimators, i.e. O(Nm). 

Effect of parameter r: We study the effectiveness of the parameter r in our proposed 
estimators on Condmat dataset. Similar results can be observed from other datasets. 
Fig. [3] and Fig. |4] show the relative variance of our type-I and type-II estimators w.r.t. 
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Table IX. Scalability: Running time on synthetic graphs. Here the two num- 
bers in the 2nd-5th columns (eg. 200k/800k) indicate the numbers of nodes 
and edges respectively 



Time (s) 


200k/800k 


400k/1.3m 


600k/1.6m 


800k/3.2m 


NMC 


26.0820 


156.9600 


289.7720 


365.0280 


BSS-I-RM 


25.2090 


159.1990 


281.6810 


343.0350 


BSS-I-BFS 


27.2120 


169.6120 


286.2180 


368.0910 


RSS-I-RM 


23.3430 


143.6700 


264.9790 


342.3920 


RSS-I-BFS 


25.2090 


169.6120 


286.2180 


344.0180 


BSS-II-RM 


26.1450 


161.4100 


287.1500 


371.4770 


BSS-II-BFS 


29.5760 


162.3930 


290.9340 


374.6830 


RSS-II-RM 


26.4440 


156.8120 


270.6670 


363.1590 


RSS-II-BFS 


26.4990 


162.7940 


271.1370 


365.9630 




Fig. 3. Effect of r of BSS-I/RSS-I estimators. 



various r. As can be seen in Fig. [3j the BSS-I estimators exhibit similar relative vari- 
ance over different r values. However, the relative variance of the RSS-I- RM estimator 
decreases as the r increases when r < 5, and otherwise it increases as the r increases. 
For the RSS-I-BFS estimator, the relative variance decreases as r increases, and when 
r > 5 the descent rate is very small, and the curve tends to be smooth. Based on this 
observation, r = 5 is the best choice, which is used in the previous experiments. For 
the type-II estimators, we test the parameter r from 10 to 70, and the results (Fig. [4]l 
show that all of our type-II estimators except RSS-I-BFS are not very sensitive w.r.t. 
the parameter r. As an exception, the relative variance of the RSS-I-BFS estimator 
decreases as the r increases when r < 50, and when r > 50 the the curve tends to 
be smooth. Therefore, r = 50 is a good choice. In our previous experiments, we set r 
to 50. Table |X| and Table IXj report the running time of type-I estimators and type-II 
estimators under different r values. We can see that the running time of both type-I 
estimators and type-II estimators are comparable. 

Effect of sample size: As shown in the previous experiments, the RSS-I-BFS and 
the RSS-II-BFS estimators are the best two estimators. Here we study how sample 
size affects the estimating accuracy of these two estimators on the Condmat dataset. 
Similar results can be observed on the other dataset. Fig.[5]shows the relative variance 
of the estimators under various sample size. As can be observed in Fig. [5) the curves 
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Fig. 4. Effect of r of BSS-II/RSS-II estimators. 



Table X. BSS-I/RSS-I estimators: Running time vs r 



Time (s) 


r = 1 


r = 2 


r = 3 


r = 5 


r = 10 


BSS-I-RM 
BSS-I-BFS 
RSS-I-RM 
RSS-I-BFS 


1.2642 
1.2731 
1.2082 
1.2158 


1.2705 
1.2791 
1.1993 
1.2140 


1.2770 
1.2798 
1.1810 
1.2189 


1.3043 
1.3054 
1.2050 
1.2411 


1.2986 
1.3686 
1.1172 
1.1833 


Table XI. BSS-II/RSS-II estimators: Running time vs r 


Time (s) 


r = 10 


r = 20 


r = 30 


r = 50 


r = 70 


BSS-II-RM 
BSS-II-BFS 
RSS-II-RM 
RSS-II-BFS 


1.2515 
1.2579 
1.2279 
1.2358 


1.2502 
1.2719 
1.2246 
1.2278 


1.2511 
1.2836 
1.2162 
1.2258 


1.2513 
1.3104 
1.2256 
1.2284 


1.2524 
1.3447 
1.2092 
1.2295 



[Kempe et al. 2005 
|Goyal et al. 2010 



of RSS-I-BFS and RSS-II-BFS estimators are very smooth, which indicate that the 
relative variance of both RSS-I-BFS and RSS-II-BFS estimators are robust w.r.t. the 
sample size. 

7. RELATED WORK 

After the seminal work by Kempe, et al. | [Kempe et al. 2003) , influence max- 
imization in social networks has recently attracted much attention in data 
mining and social net work analysis r ese arch communitie s 
Leskovec et al. 20071 IChen et al. 20091 IChen et al. 20101 

Chen et al. 2011[ Goyal et al. 2011) . A crucial subroutine in influence maximiza- 
tion is the influence function evaluation to which the influenceability estimation 
problem presented in this paper is closely related. In the following, we first review 
some notable work on influence maximizati on problem, and then discuss the existing 
work on influence function evaluation. In ULeskovec et al. 20071 . the authors study 
the influence maximization problem under the context of water distribution and 
blogosphere monitoring. They propose a so-called CELF framework for optimizing the 
influence maximization al gorithms. To furth er accelerate the influence maximization 
algorithms, Chen, et al. in IChen et al. 200911 propose a scalable algorithm by sampling 
N possible graphs and estimating the influence spread of all vertices on each possible 
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graph at one time. Subsequent ly, th e same authors propose a series of scalable 
algorithms in llChen et al. 20101 and llChen et al. 201 ill for influence maximization 
by developing the heuristic vertices-selection strategies on unsigned and signed 
networks, respectively Recently, Goyal, et al. in I Goyal et al. 2010[ Goyal et al. 2011 1 
consider the problem of learning the influence probabilities, and study the influence 
maximization from a data-driven perspective. Note that all the mentioned methods 
focus on the influence maximization problem. For the influence function evaluation 
problem, Kempe, et a l. firstly pose it as an open problem in | Kempe et al. 2005 1. 
Then, Chen, et al. in llChen et al. 20101 show that the influence function evaluation 
problem is #P complete. Gi ven the hardness of the problem, most of the existing work 
for this problem , such as [ Kempe et al. 2003} ILeskovec et al. 2007t IChen et al. 20091 
IChen et al. 20101 . are based on the Naive Monte-Carlo (NMC) sampling. In this paper, 
we study the influenceability evaluation problem and develop more accurate RSS 
estimators for estimating the influenceability, and our algorithms can also be used for 
influence function evaluation. 

Our work is also related to the uncertain graph mining. Recently, uncertain graphs 
mining have been at tracted increased interest because of the increasing app lications in 
biological data base USevon et al. 200611 . network routing BGhosh et al. 20071 . and influ- 
ence networks I Goyal et al. 2011 1. There are a large body of works have been proposed 
in the litera ture. Notable work includes finding the reliable subgraph in a large uncer- 
tain graph BHintsanen and Toivon en 2008 ; Uin et al. 2011aL freque nt subgraph min- 
ing in uncertain graph da tabase IIZou et al. 2010allZou et al. 2010bll . subgraph search 
in larg e uncertain graph ifYuan et al. 201111 . K-nearest neighbor search in uncertain 
graph UPotamias et al. 20101, an d distance constraint reachability computation in un- 
certain graph HJin et al. 2011bl . In general, all the mentioned uncertain graph min- 
ing problems are shown to be #P-complete, and thereby finding the exact solution 
is intractable in larg e unc ertain graphs. C onsequently, most existing work, such as 
UPotamias et al. 201011 and MJin et al. 2011al , are based on NMC sampling. Basically, 
the NMC sampling based methods lead to a lar ge variance, thu s reduce the perfor- 
mance of the algorithms. Recently, Jin, et al. in Uin et al. 20 lib! propose a recursive 
stratified sampling method for distance-constraint reachability computation on uncer- 
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tain graph, although they do not claim their method is a stratified sampling. It is im- 
portant to note that their method is a very special case of our RSS-I algorithm. In their 
method, they select only one edge for stratification at a time, and then recursively per- 
form this procedure. Unlike their algorithm, first, we develop a generalized algorithm 
(RSS-I) that selects r edges for stratification. Second, unlike their reachability prob- 
lem, here we study the influenceability evaluation problem using the RSS-I sampling. 
Moreover, in our work, we also develop another RSS estimator, i.e. RSS-II estimator. 
Note that all of our RSS estimators can also be applied into the distance-constraint 
reachability computation problem. 

In addition, our work is related to the network reliability estimation problem, where 
a network is modeled as an uncertain graph and the go al is to estimate some reliability 
metrics of the network [Fishman 1986a[ iRubino 19991. There are many work on th is 
topic in the last five decades. Surveys can be found in HColbourn 1987tlRubino 19991 . 

Below, we revie w the Monte-Carlo alg orithms for network reliability estimation. 
Kumamoto, et al. UKumamoto et al. 19771 propose an efficient Monte -Carlo algor ithm 
by exploiting the bound of the reliability metric. Fishman [ Fishman 198 6b I pro- 
poses a more generalized Monte-Carlo algorithm b ased on such bo und techniques 
for reliability estimation. Subsequently, Fishman IFishman 1986all compares four 
Monte -Carlo algorithms for network reliability estimation problem. Cancela, et al. in 
ICancela and Khadiri 200311 propose a recursive variance-reduction algorithm for net- 
work reliability estimation. Note that all the mentioned Monte-Carlo algorithms are 
tailored for the network reliability estimation problem, and the reliability measure is 
typically a Boolean metric thus they cannot be used in our problem. 

8. CONCLUSIONS 

In this paper, we focus on the influenceability evaluation problem, which is a funda- 
mental issue for influence analysis in social network. This problem is known to be #P- 
complete, and the only existing algorithm is based on the Naive Monte-Carlo (NMC) 
sampling. To reduce the variance of the NMC estimator, we propose two basic strati- 
fied sampling (BSS) estimators. Furthermore, based on our BSS estimators, we present 
two recursive stratified sampling (RSS) estimators. We conduct comprehensive exper- 
iments on one synthetic and three real datasets, and the results confirm that our RSS 
estimators reduce the variance of the NMC estimator by several times. There are sev- 
eral future directions that deserve further investigation. First, most of our estimators 
except the RSS estimators with BFS edge selection do not take the graph structural 
information into account. In our experiments, the RSS estimators with BFS edge selec- 
tion are shown much better performance than the RSS estimators with random edge 
selection. A promising direction is to exploit the graph structural information to de- 
velop more efficient and more accurate estimators for influenceability evaluation. Sec- 
ond, our estimation techniques are quite general. For many uncert ain graph mining 
problems, such as shortest path IPotamias et al. 20101 . reachability llJin et al. 2011bt 
and reliable subgraph discovery HJin et al. 2011aL our estimators can be directly used. 
For these problems, we only need to replace the <j) s (Gp) to other quantities, such as 
the length of the shortest path, the reachability function between two nodes, and the 
reliable subgraph metric. Most of these uncertain graph mining problems are based 
on NMC. Another promising future direction is to apply our estimation techniques to 
these problems. 
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